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Abstract 

We introduce a new Bayesian model for hierarchical clustering based on a prior 
over trees called Kingman's coalescent. We develop novel greedy and sequential 
Monte Carlo inferences which operate in a bottom-up agglomerative fashion. We 
show experimentally the superiority of our algorithms over others, and demon- 
strate our approach in document clustering and phylolinguistics. 



1 Introduction 

Hierarchically structured data abound across a wide variety of domains. It is thus not surprising that 
hierarchical clustering is a traditional mainstay of machine learning [ 1 1. The dominant approach to 
hierarchical clustering is agglomerative: start with one cluster per datum, and greedily merge pairs 
until a single cluster remains. Such algorithms are efficient and easy to implement. Their primary 
limitations — a lack of predictive semantics and a coherent mechanism to deal with missing data — 
can be addressed by probabilistic models that handle partially observed data, quantify goodness-of- 
fit, predict on new data, and integrate within more complex models, all in a principled fashion. 

Currently there are two main approaches to probabilistic models for hierarchical clustering. The 
first takes a direct Bayesian approach by defining a prior over trees followed by a distribution over 
data points conditioned on a tree O [3] |4j [5]|. MCMC sampling is then used to obtain trees from 
their posterior distribution given observations. This approach has the advantages and disadvantages 
of most Bayesian models: averaging over sampled trees can improve predictive capabilities, give 
confidence estimates for conclusions drawn from the hierarchy, and share statistical strength across 
the model; but it is also computationally demanding and complex to implement. As a result such 
models have not found widespread use. [2 1 has the additional advantage that the distribution induced 
on the data points is exchangeable, so the model can be coherently extended to new data. The 
second approach uses a flat mixture model as the underlying probabilistic model and structures the 
posterior hierarchically |6][7). This approach uses an agglomerative procedure to find the tree giving 
the best posterior approximation, mirroring traditional agglomerative clustering techniques closely 
and giving efficient and easy to implement algorithms. However because the underlying model has 
no hierarchical structure, there is no sharing of information across the tree. 

We propose a novel class of Bayesian hierarchical clustering models and associated inference algo- 
rithms combining the advantages of both probabilistic approaches above. 1) We define a prior and 
compute the posterior over trees, thus reaping the benefits of a fully Bayesian approach; 2) the dis- 
tribution over data is hierarchically structured allowing for sharing of statistical strength; 3) we have 
efficient and easy to implement inference algorithms that construct trees agglomeratively; and 4) the 
induced distribution over data points is exchangeable. Our model is based on an exchangeable distri- 
bution over trees called Kingman's coalescent [ 8 , 9 1 . Kingman's coalescent is a standard model from 
population genetics for the genealogy of a set of individuals. It is obtained by tracing the genealogy 
backwards in time, noting when lineages coalesce together. We review Kingman's coalescent in 
Section[2] Our own contribution is in using it as a prior over trees in a hierarchical clustering model 
(Section|3]l and in developing novel inference procedures for this model (Section^. 
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Figure 1: (a) Variables describing the n-coalescent. (b) Sample path from a Brownian diffusion 
coalescent process in ID, circles are coalescent points, (c) Sample observed points from same in 
2D, notice the hierarchically clustered nature of the points. 



2 Kingman's coalescent 

Kingman's coalescent is a standard model in population genetics describing the common genealogy 
(ancestral tree) of a set of individuals EE). In its full form it is a distribution over the genealogy of 
a countably infinite set of individuals. Like other nonparametric models (e.g. Gaussian and Dirich- 
let processes), Kingman's coalescent is most easily described and understood in terms of its finite 
dimensional marginal distributions over the genealogies of n individuals, called n-coalescents. We 
obtain Kingman's coalescent as n— >oo. 

Consider the genealogy of n individuals alive at the present time t = 0. We can trace their ancestry 
backwards in time to the distant past t= — oo. Assume each individual has one parent (in genetics, 
haploid organisms), and therefore genealogies of [n] = {1, n} form a directed forest. In general, 
at time t < 0, there are m (1 < m < n) ancestors alive. Identify these ancestors with their correspond- 
ing sets p\, p m of descendants (we will make this identification throughout the paper). Note that 
n(t) = {pi, p m } form a partition of [n], and interpret ii— >7r(t) as a function from (— oo, 0] to the 
set of partitions of [n]. This function is piecewise constant, left-continuous, monotonic (s<t implies 
that w(t) is a refinement of n(s)), and 7r(0) = {{1}, {n}} (see Figure[T^). Further, n completely 
and succinctly characterizes the genealogy; we shall henceforth refer to ir as the genealogy of [n] . 

Kingman's n-coalescent is simply a distribution over genealogies of [n], or equivalently, over the 
space of partition-valued functions like n. More specifically, the n-coalescent is a continuous-time, 
partition-valued, Markov process, which starts at {{1}, . . . , {n}} at present time t — 0, and evolves 
backwards in time, merging (coalescing) lineages until only one is left. To describe the Markov 
process in its entirety, it is sufficient to describe the jump process (i.e. the embedded, discrete-time, 
Markov chain over partitions) and the distribution over coalescent times. Both are straightforward 
and their simplicity is part of the appeal of Kingman's coalescent. Let pu,p r i be the zth pair of 
lineages to coalesce, t n -i < ■ ■ ■ < t\ < to = be the coalescent times and Si = ti^i — ti > 
be the duration between adjacent events (see Figure [lj). Under the n-coalescent, every pair of 
lineages merges independently with rate 1 . Thus the first pair amongst m lineages merge with rate 

(™ ) = m 2 ■ Therefore Si ~Exp (( n- o +1 )) independently, the pair pu, p r i is chosen from among 
those right after time ti, and with probability one a random draw from the n-coalescent is a binary 
tree with a single root at t= ^do and the n individuals at time t = 0. The genealogy is given as: 

r{{l},...,{n}} if i = 0; 

n(t) = < 7T t4 _ 1 - pu - p ri + (pu U p ri ) \f t = U\ (1) 

[ir u ift i+1 < t <ti. 

Combining the probabilities of the durations and choices of lineages, the probability of ir is simply: 

PW = Ei 1 (-2 +1 ) ex P /{ n -l +1 ) = exp H"^ +1 H) (2) 

The n-coalescent has some interesting statistical properties EHU. The marginal distribution over 
tree topologies is uniform and independent of the coalescent times. Secondly, it is infinitely ex- 
changeable: given a genealogy drawn from an n-coalescent, the genealogy of any m contemporary 
individuals alive at time t < embedded within the genealogy is a draw from the m-coalescent. 
Thus, taking n — > oo, there is a distribution over genealogies of a countably infinite population 
for which the marginal distribution of the genealogy of any n individuals gives the n-coalescent. 
Kingman called this the coalescent. 
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3 Hierarchical clustering with coalescents 



We take a Bayesian approach to hierarchical clustering, placing a coalescent prior on the latent tree 
and modeling observed data with a Markov process evolving forward in time along the tree. We will 
alter our terminology from genealogy to tree, from n individuals at present time to n observed data 
points, and from individuals on the genealogy to latent variables on the tree-structured distribution. 
Let x\ , . . . , x n be n observed data at the leaves of a tree ir drawn from the n-coalescent. it has n—1 
coalescent points, the zth occuring when pu and p r i merge at time ti to form pi = pu U p r i. Let tu 
and t r i be the times at which pu and p rj ; are themselves formed. 

We construct a continuous-time Markov process evolving along the tree from the past to the present, 
branching independently at each coalescent point until we reach time 0, where the n Markov pro- 
cesses induce a distribution over the n data points. The joint distribution respects the conditional 
independences implied by the structure of the directed tree. Let y Pi be a latent variable that takes on 
the value of the Markov process at pi just before it branches (see Figure[lJ). Let y^ = Xi at leaf i. 

To complete the description of the likelihood model, let q(z) be the initial distribution of the Markov 
process at time t = — oo, and k st (x, y) be the transition probability from state x at time s to state y 
at time t. This Markov process need be neither stationary nor ergodic. Marginalizing over paths of 
the Markov process, the joint probability over the latent variables and the observations is: 

p(x, y, z\tt) = q{z)k^ 00 tn _ 1 (z, y Pn _J Km {Vpi , V P u) k Ut n {v Pi , V PH ) (3) 

Notice that the marginal distributions at each observation p(xi\ir) are identical and given by the 
Markov process at time 0. However, they are not independent: they share the same sample path down 
the Markov process until they split. In fact the amount of dependence between two observations is 
a function of the time at which the observations coalesce in the past. A more recent coalescent time 
implies larger dependence. The overall distribution induced on the observations p(x) inherits the 
infinite exchangeability of the n-coalescent. We considered a brownian diffusion (see Figures[TJb,c)) 
and a simple independent sites mutation process on multinomial vectors (Section |43j ). 

4 Agglomerative sequential Monte Carlo and greedy inference 

We develop two classes of efficient and easily implementable inference algorithms for our hierar- 
chical clustering model based on sequential Monte Carlo (SMC) and greedy schemes respectively. 
In both classes, the latent variables are integrated out, and the trees are constructed in a bottom-up 
fashion. The full tree tt can be expressed as a series of n — 1 coalescent events, ordered backwards 
in time. The ith coalescent event involves the merging of the two subtrees with leaves pu and p rL 
and occurs at a time Si before the previous coalescent event. Let 0i = {Sj , pij , p r j for j < i} denote 
the first i coalescent events. B n -\ is equivalent to it and we shall use them interchangeably. 

We assume that the form of the Markov process is such that the latent variables {y Pi }™=i and z can 
be efficiently integrated out using an upward pass of belief propagation on the tree. Let M p .(y) be 
the message passed from y Pi to its parent; (y) = 5 Xi (y) is point mass at Xi for leaf i. M Pi (y) 
is proportional to the likelihood of the observations at the leaves below coalescent event i, given that 
y p . = y. Belief propagation computes the messages recursively up the tree; for i = 1, .. . , n — 1: 

M P M = Z-^x, 9i) l\ b =i,r I k utjy, Vb)M Pbi {y b ) dy b (4) 

Z Pi (yi, 0i) is a normalization constant introduced to avoid numerical problems. The choice of Z 
does not affect the probability of x, but does impact the accuracy and efficiency of our inference 
algorithms. We found that Z Pi (x, 0$) = J q(y)M p . (y) dy worked well. At the root, we have: 

Z_oo(x,0 n _i) = / q(z)k- ootn _ 1 (z,y)M Pn _ 1 (y)dydz (5) 
The marginal probability p(x\tt) is now given by the product of normalization constants: 

p(x|tt) = Z_^(x, 0„_O nrJi 1 Z pi (x, di) (6) 
Multiplying in the prior |2]) over it, we get the joint probability for the tree it and observations x: 

p(x,7r) = Z_ 00 (x,^_ 1 )nr=i 1 exp(-("-* +1 )<5 i ) Z„(x,0,) (7) 
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Our inference algorithms are based upon Q. Note that each term Z Pi (x, 9{) can be interpreted as a 
local likelihood term for coalescing the pair pu, p r ^\ In general, for each i, we choose a duration Si 
and a pair of subtrees pu, p r i to coalesce. This choice is based upon the ith term in |7j, interpreted 
as the product of a local prior and a local likelihood for choosing Si, p ti and p ri given 6>j_i. 

4.1 Sequential Monte Carlo algorithms 

Sequential Monte Carlo algorithms (aka particle filters), approximate the posterior using a weighted 
sum of point masses 1 10 1 . These point masses are constructed iteratively. At iteration particle 
s consists of 9f_ l — {<5|, pf., p s r ^ for j < i}, and has weight wf_ 1 . At iteration i, s is extended by 
sampling 6", p s u and p s ri from a proposal distribution fi(Sf, p s Ul p^il^f-i)' with weights: 

< = <_ x exp (-("^ +1 K') Z Pi (x, 91) I m, p s H ,p s ri \6U) (8) 

After n — 1 iterations, we obtain a set of trees and weights u>£_i. The joint distribution 

is approximated by: p(ir,x) w u; n-i^e s „ 1 ( 7r )' while the posterior is approximated with the 
weights normalized. An important aspect of SMC is resampling, which places more particles in 
high probability regions and prunes particles stuck in low probability regions. We resample as in 
Algorithm 5.1 of (Tl | when the effective sample size ratio as estimated in [12] falls below one half. 

SMC-PriorPrior. The simplest proposal distribution is to sample 6*, p H and p s ri from the local 
prior. 5f is drawn from an exponential with rate (™~2 +1 ) an d Put Pri are drawn uniformly from 
all available pairs. The weight updates ^ reduce to multiplying by Z p .(x, 9f). This approach is 
computationally very efficient, but performs badly with many objects due to the uniform draws over 
pairs. SMC-PriorPost. The second approach addresses the suboptimal choice of pairs to coalesce. 
We first draw <5| from its local prior, then draw p s H , p s ri from the local posterior: 

fi(pli,Pri\ 5 i^i-l) « ZpfaOi-l&^lvPUY, Wf = wt_ 1 Y,p>,p> r Z P^ d !-l> S !>Pl>Pr) (9) 

This approach is more computationally demanding since we need to evaluate the local likelihood of 
every pair. It also performs significantly better than SMC-PriorPrior. We have found that it works 
reasonably well for small data sets but fails in larger ones for which the local posterior for Si is highly 
peaked. SMC-PostPost. The third approach is to draw all of Sf, pf i and p s ri from their posterior: 

fm,Pu,Pri\n-i) « ex P (-( n i +i K) Zp^u^i^^p^) 

< = <-i H p[ . K I ex P (-("1 +1 )<5') Z w (x, 9i_ 1 ,5>, p[, p' r ) dS' (10) 

This approach requires the fewest particles, but is the most computationally expensive due to the 
integral for each pair. Fortunately, for the case of Brownian diffusion process described below, these 
integrals are tractable and related to generalized inverse Gaussian distributions. 

4.2 Greedy algorithms 

SMC algorithms are attractive because they produce an arbitrarily accurate approximation to the full 
posterior. However in many applications a single good tree is often times sufficient. We describe a 
few greedy algorithms to construct a good tree. 

Greedy-MaxProb: the obvious greedy algorithm is to pick Si, pu and p„ maximizing the zth term 
in Q. We do so by computing the optimal Si for each pair of pu, p r i, and then picking the pair 
maximizing the ith term at its optimal Si. Greedy-MinDuration: simply pick the pair to coalesce 
whose optimal duration is minimum. Both algorithms require recomputing the optimal duration for 
each pair at each iteration, since the exponential rate ("~2 +1 ) on the duration varies with the iteration 
i. The total computational cost is thus 0(n 3 ). We can avoid this by using the alternative view of the 
rt-coalesent as a Markov process where each pair of lineages coalesces at rate 1 . Greedy-Ratel : for 
each pair pu and p r i we determine the optimal Si, but replacing the ("~2 +1 ) P r i° r rate with 1. We 
coalesce the pair with most recent time (as in Greedy-MinDuration). This reduces the complexity to 
0(n 2 ). We found that all three perform about equally well. 

'if the Markov process is stationary with equilibrium q(y), Z Pi (x,9i) is a likelihood ratio between two 
models with observations x Pi : (1) a single tree with leaves pi\ (2) two independent trees with leaves pu and p r i 
respectively. This is similar to (6||7) an d is used later in our NIPS experiment to determine coherent clusters. 
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4.3 Examples 



Brownian diffusion. Consider the case of continuous data evolving via Brownian diffusion. The 
transition kernel k st (y, •) is a Gaussian centred at y with variance (t — s)A, where A is a symmetric 
p.d. covariance matrix. Because the joint distribution ^ over x, y and z is Gaussian, we can express 
each message M Pi (y) as a Gaussian with mean y Pi and variance Av Pi . The local likelihood is: 

Z Pi {x,6i) = \2nAi\-i exp(-| \\y PH -y P „\\\y, A, = A(v PH +v Prt +t H +t„~2t l ) (11) 

where \\x\\y = x T, $>~ 1 x is the Mahanalobis norm. The optimal duration Si can also be solved for, 

5i = 4 (n-i+i) (\A("~2 +1 ) \\ym-V P J\ 2 A + D 2 -D) - \(v Pu +v PTi +tu+t„-2U^) (12) 

where D is the dimensionality. The message at the newly coalesced point has mean and covariance: 

vpi = ((««« + tu - U)- 1 + (v Pri + U - U)- 1 )' 1 ;^ = { v J + Z-u + vJtti-u W ( 13 > 

Multinomial vectors. Consider a Markov process acting on multinomial vectors with each entry 
taking one of K values and evolving independently. Entry d evolves at rate X c i and has equilibrium 
distribution vector qd- The transition rate matrix is Qd — XdiqJ^K — Ik) where Ik is a vector of 
K ones and Ik is identity matrix of size K, while the transition probability matrix for entry d in 
a time interval of length t is e® dt — e~ Xdt lK + (1 — e~~ Xdt )qJlK- Representing the message for 
entry d from pi to its parent as a vector M p . = [M p ^ , M p ^] T , normalized so that qd ■ M p . = 1, 
the local likelihood terms and messages are computed as, 

Z% (x, 9i) = 1 - e M»«-*"-*H) (i _ £f =1 qdk M^M^ t ) (14) 

Mf 4 = (1 - e^-(«i-*i*)(i - M d pH )){\ - e A ^*--*")(l - M p d r .))/^ i (x,6» i ) (15) 
Unfortunately the optimal 6} cannot be solved analytically and we use Newton steps to compute it. 

4.4 Hyperparameter estimation and predictive density 

We perform hyperparameter estimation by iterating between estimating a geneology, then re- 
estimating the hyperparamters conditioned on this tree. Space precludes a detailed discussion of 
the algorithms we use; they can be found in the supplemental material. In the Brownian case, we 
place an inverse Wishart prior on A and the MAP posterior A is available in a standard closed form. 
In the multinomial case, the updates are not available analytically and must be solved iteratively. 

Given a tree and a new individual y' we wish to know: (a) where y' might coalescent and (b) what 
the density is at y'. In the supplemental material, we show that the probability that y' merges at 
time t with a given sibling is available in closed form for the Brownian motion case. To obtain the 
density, we sum over all possible siblings and integrate out t by drawing equally spaced samples. 

5 Experiments 

Synthetic Data Sets In Figure|2]we compare the various SMC algorithms and Greedy- Rate 1^] on 
a range of synthetic data sets drawn from the Brownian diffusion coalescent process itself (A = Id) 
to investigate the effects of various parameters on the efficacy of the algorithms. Generally SMC- 
PostPost performed best, followed by SMC-PriorPost, SMC-PriorPrior and Greedy-Ratel . With 
increasing D the amount of data given to the algorithms increases and all algorithms do better, 
especially Greedy-Ratel . This is because the posterior becomes concentrated and the Greedy-Ratel 
approximation corresponds well with theposterior. As n increases, the amount of data increases 
as well and all algorithms perform bette^J However, the posterior space also increases and SMC- 
PriorPrior which simply samples from the prior over genealogies does not improve as much. We 
see this effect as well when S is small. As S increases all SMC algorithms improve. Finally, the 
algorithms were surprisingly robust when there is mismatch between the generated data sets' A and 
the A used by the model. We expected all models to perform worse with SMC-PostPost best able to 
maintain its performance (though this is possibly due to our experimental setup). 

2 We found in unreported experiments that the greedy algorithms worked about equally well. 
3 Each panel was generated from independent runs. Data set variance affected all algorithms, varying overall 
performance across panels. However, trends in each panel are still valid, as they are based on the same data. 
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Figure 2: Predictive performance of algorithms as we vary (a) the numbers of dimensions D, (b) 
observations n, (c) the mutation rate A (A = \Id), and (d) number of samples S. In each panel 
other parameters are fixed to their middle values (we used S = 50) in other panels, and we report 
log predictive probabilities on one unobserved entry, averaged over 100 runs. 





MNIST 

Avg-link BHC Coalescent 


SPAMBASE 
Avg-link BHC Coalescent 


Purity 

Subtree 

LOO-acc 


.363±.004 .392±.006 .412±.006 
.581±.005 .579±.005 .610±.005 
.755±.005 .763±.005 .773±.005 


.616±.007 .711±.010 .689±.008 
.607±.011 .549±.015 .661±.012 
.846±.010 .832±.010 .861±.008 



Table 1: Comparative results. Numbers are averages and standard errors over 50 and 20 repeats. 



MNIST and SPAMBASE We compare the performance of our approach (Greedy-Ratel with 
10 iterations of hyperparameter update) to two other hierarchical clustering algorithms: average- 
link agglomerative clustering and Bayesian hierarchical clustering 0. In MNIST, We use 10 
digits from the MNIST data set, 20 examplars for each digit and 20 dimensions (reduced via 
PCA), repeating the experiment 50 times. In SPAMBASE, we use 100 examples of 57 at- 
tributes each from 2 classes, repeating 20 times. We present purity scores J6j, subtree scores 
(^{interior nodes with all leaves of same class}/(n — ^classes)) and leave-one-out accuracies (all 
scores between and 1, higher better). The results are in Table[T} as we can see, except for purity on 
SPAMBASE, ours gives the best performance. Experiments not presented here show that all greedy 
algorithms perform about the same and that performance improves with hyperparameter updates. 

Phylolinguistics We apply our approach (Greedy-Ratel) to a phylolinguistic problem: language 
evolution. Unlike previous research [ 1 3 1 which studies only phonological data, we use a full 
typological database of 139 binary features over 2150 languages: the World Atlas of Language 
Structures (henceforth, "WALS") fl4"l . The data is sparse: about 84% of the entries are unknown. 
We use the same version of the database as extracted by [ 15]. Based on the Indo-European subset of 
this data for which at most 30 features are unknown (48 language total), we recover the coalescent 
tree shown in Figure |3ja). Each language is shown with its genus, allowing us to observe that it 
teases apart Germanic and Romance languages, but makes a few errors with respect to Iranian and 
Greek. (In the supplemental material, we report results applied to a wider range of languages.) 

Next, we compare predictive abilities to other algorithms. We take a subset of WALS and tested on 
5% of withheld entries, restoring these with various techniques: Greedy-Ratel; nearest neighbors 
(use value from nearest observed neighbor); average-linkage (nearest neighbor in the tree); and 
probabilistic PCA (latent dimensions in 5, 10,20,40, chosen optimistically). We use five subsets 
of the WALS database of varying size, obtained by sorting both the languages and features of the 
database according to how many cells are observed. We then use a varying percentage (10% — 50%) 
of the densest portion. The results are in Figure |3jb). The performance of PPCA is steady around 
76%. The performance of the other algorithms degrades as the sparsity incrases. Our approach 
performs at least as well as all the other techniques, except at the two extremes. 

NIPS We applied Greedy-Ratel to all NIPS abstracts through NIPS12 (1740, total). The data was 
preprocessed so that only words occuring in at least 100 abstracts were retained. The word counts 
were then converted to binary. We performed one iteration of hyperparameter re-estimation. In the 
supplemental material, we depict the top levels of the coalescent tree. Here, we use use the tree to 
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Setic Irish 
e tic Gaelic (Scots) 
Cetic Welsh 
Cetic Cornish 
Cetic Breton 
ranian] Tajik 
ranian Persian 
ranian Kurdish (Central) 

, _■ . French 

German 
Dutch 
English 
Icelandic 
Swedish 
Norwegian 
Danish 

eek^Modern) 



Romance 

e ermanic 
ermanic 
Germanic' 
Germanic' 

eermanic' 
ermanic 
Germanic' 
Romance 
~ Xa G 



Slavic] Bulgarian 



Romance 
Romance 
Romance' 
Romance' 



banian] t 
avic] Pol 
avic 
avic 
avic 
avic 



Baltic 



Homanian 
Portuguese 
Italian 
Catalan 
Albanian 
. jlish 
Slovene 

Serbian-Croatian 
Ukrainian 
Russian 
Lithuanian 



Baltic] Latvian 
Slavic] Czech 
ranianl Pashto 
ndic 
ndic 
ndic 
ndic 
ndic 
ranian] 
ndic 
ndic 
ndic 



Hfn n cif bi 
Kashmiri 
Sinhala 
Nepali 
i] Ossetic 
Maithili 
Marathi 
... Bengali 
Armenian] Armenian (Western) 
Armenian Armenian (Eastern) 



(a) Coalescent for a subset of Indo-European lan- 
guages from WALS. 




0.2 0.3 0.4 0.5 

(b) Data restoration on WALS. Y-axis is accuracy; 
X-axis is percentage of data set used in experiments. 
At 10%, there are TV = 215 languages, H = 14 
features and p — 94% observed data; at 20%, TV = 
430, H = 28 and p = 80%; at 30%: TV = 645, 
H = 42 and p = 66%; at 40%: TV = 860, H = 
56 and p = 53%; at 50%: TV = 1075, H = 70 
and p = 43%. Results are averaged over five folds 
with a different 5% hidden each time. (We also tried 
a "mode" prediction, but its performance is in the 
60% range in all cases, and is not depicted.) 



Figure 3: Results of the phylolinguistics experiments. 



LLR (t) 


Top Words 


Top Authors 


32.7 (-2.71) 


bifurcation attractors hopfield network saddle 


Mjolsness (9) Saad (9) Ruppin (8) Coolen (7) 


0.106 (-3.77) 


voltage model cells neurons neuron 


Koch (30) Sejnowski (22) Bower (1 1) Dayan (10) 


83.8 (-2.02) 


chip circuit voltage vlsi transistor 


Koch (12) Alspector (6) Lazzaro (6) Murray (6) 


140.0 (-2.43) 


spike ocular cells firing stimulus 


Sejnowski (22) Koch (18) Bower (11) Dayan (10) 


2.48 (-3.66) 


data model learning algorithm training 


Jordan (17) Hinton (16) Williams (14) Tresp (13) 


31.3 (-2.76) 


infomax image ica images kurtosis 


Hinton (12) Sejnowski (10) Amari (7) Zemel (7) 


31.6 (-2.83) 


data training regression learning model 


Jordan (16) Tresp (13) Smola (11) Moody (10) 


39.5 (-2.46) 


critic policy reinforcement agent controller 


Singh (15) Barto (10) Sutton (8) Sanger (7) 


23.0 (-3.03) 


network training units hidden input 


Mozer (14) Lippmann (11) Giles (10) Bengio (9) 



Table 2: Nine clusters discovered in NIPS abstracts data. 



generate a flat clustering. To do so, we use the log likelihood ratio at each branch in the coalescent 
to determine if a split should occur. If the log likelihood ratio is greater than zero, we break the 
branch; otherwise, we recurse down. On the NIPS abstracts, this leads to nine clusters, depicted 
in Table [2] Note that clusters two and three are quite similar — had we used a slighly higher log 
likelihood ratio, they would have been merged (the LLR for cluster 2 was only 0.105). Note that 
the clustering is able to tease apart Bayesian learning (cluster 5) and non-bayesian learning (cluster 
7) — both of which have Mike Jordan as their top author! 



6 Discussion 

We described a new model for Bayesian agglomerative clustering. We used Kingman's coalescent 
as our prior over trees, and derived efficient and easily implementable greedy and SMC inference 
algorithms for the model. We showed empirically that our model gives better performance than other 
agglomerative clustering algorithms, and gives good results on applications to document modeling 
and phylolinguistics. 

Our model is most similar in spirit to the Dirichlet diffusion tree of 1 2 1 . Both use infinitely exchange- 
able priors over trees. While [2 1 uses a fragmentation process for trees, our prior uses the reverse — a 
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coalescent process instead. This allows us to develop simpler inference algorithms than those in 
0, though it will be interesting to consider the possibility of developing analogous algorithms for 
. J3 | also describes a hierarchical clustering model involving a prior over trees, but his prior is 
not infinitely exchangeable. [5| uses tree-consistent partitions to model relational data; it would be 
interesting to apply our approach to their setting. Another related work is the Bayesian hierarchical 
clustering of [6|, which uses an agglomerative procedure returning a tree structured approximate 
posterior for a Dirichlet process mixture model. As opposed to our work O uses a fiat mixture 
model and does not have a notion of distributions over trees. 

There are a number of unresolved issues with our work. Firstly, our algorithms take 0(n 3 ) compu- 
tation time, except for Greedy- Rate 1 which takes 0(n 2 ) time. Among the greedy algorithms we see 
that there are no discernible differences in quality of approximation thus we recommend Greedy- 
Ratel. It would be interesting to develop SMC algorithms with 0(n 2 ) runtime. Secondly, there 
are unanswered statistical questions. For example, since our prior is infinitely exchangeable, by de 
Finetti's theorem there is an underlying random distribution for which our observations are i.i.d. 
draws. What is this underlying random distribution, and how do samples from this distribution look 
like? We know the answer for at least a simple case: if the Markov process is a mutation process 
with mutation rate a/2 and new states are drawn i.i.d. from a base distribution H, then the induced 
distribution is a Dirichlet process DP(a,H) |8|. Another issue is that of consistency — does the 
posterior over random distributions converge to the true distribution as the number of observations 
grows? Finally, it would be interesting to generalize our approach to varying mutation rates, and to 
non-binary trees by using generalizations to Kingman's coalescent called A-coalescents [16|. 
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